Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

To be fair I’m not a huge fan of statistics and computer sciences. I’m working as a doctor and doing my PhD on the side. I did one of such courses more than two years ago and i have forgotten most of it by now. There aren’t many courses available to be done fully remotely. Thus I’m happy to have found this one on Sisu. I hope to refresh what I have already done in the past and to learn some new skills that will enable me to analyze data for my research.

The exercise set was a good way to start getting familiar with the course content and checking weather everything is working as it should. The most difficult part for me is organizing big sets of data with different variables as in last section of the exercise. The book gives a good insight to the basic commands in Rstudio. Nowadays it is also very easy to find support simply by googling the issue. So far I liked the instructions on the Moodle platform. Everything seem to be working fine so far. Looking forward to the real tasks.

https://github.com/konradsopyllo/IODS-project.git

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Dec  6 18:51:31 2023"

The text continues here.


Assignment 2

Konrad Sopyllo

The data presented below is based on a survey and the data was collected 3.12.2014 - 10.1.2015 in Florence,Italy. The data was translated in Finnish by Kimmo Vehkalahti and Liisa Myyry. The survey was based originally on a finnish study concerning the progress and satisfaction related to the double phased study programme in technical universities.

#Data wrangling

#In this data wrangling part we will focus brielfy on creating the data set as in the csv-file in the data-folder.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url<-"http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt"
data=read_tsv(url)
## Rows: 183 Columns: 60
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr  (1): gender
## dbl (59): Aa, Ab, Ac, Ad, Ae, Af, ST01, SU02, D03, ST04, SU05, D06, D07, SU0...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Creating an analysis_dataset with the variables gender, age, attitude, deep, stra, surf and points by combining the values in the learning2014 data. Note that Stra, surf and deep consist of multiple variables.

collumns <- c("gender", "Age", "Attitude", "Points")

analysis_dataset <- data[collumns]

Other variables explained:

d_sm Seeking Meaning
d_ri Relating Ideas
d_ue Use of Evidence
su_lp Lack of Purpose
su_um Unrelated Memorising
su_sb Syllabus-boundness
st_os Organized Studying
st_tm Time Management
Deep Deep approach
Surf Surface approach
Stra Strategic approach

#Stra = st_os+st_tm = ST01+ST09+ST17+ST25+ST04+ST12+ST20+ST28

stra_collumns = unlist(str_split(c("ST01+ST09+ST17+ST25+ST04+ST12+ST20+ST28"), "\\+"))

#surf = su_lp+su_um+su_sb = SU02+SU10+SU18+SU26 + SU05+SU13+SU21+SU29 + SU08+SU16+SU24+SU32

surf_collumns = unlist(str_split(c("SU02+SU10+SU18+SU26+SU05+SU13+SU21+SU29+SU08+SU16+SU24+SU32"), "\\+"))

#deep = d_sm+d_ri+d_ue = D03+D11+D19+D27 + D07+D14+D22+D30 + D06+D15+D23+D31

deep_collumns = unlist(str_split(c("D03+D11+D19+D27+D07+D14+D22+D30+D06+D15+D23+D31"), "\\+"))

#Adding stra, deep and surf by creating scaled variables using rowMeans-function.

analysis_dataset["stra"] = rowMeans(select(data, one_of(stra_collumns)))
analysis_dataset["deep"] = rowMeans(select(data, one_of(deep_collumns)))
analysis_dataset["surf"] = rowMeans(select(data, one_of(surf_collumns)))

#filtering out rows where Points variable = 0

analysis_dataset = analysis_dataset[analysis_dataset$Points != 0,]
dim(analysis_dataset)
## [1] 166   7

#As in the exercise provided the file has 166 observations and 7 variables

#Saving the analysis_dataset as a csv-file

library(readr)
write_csv(analysis_dataset, "data/learning2014.csv")

#Showing how to read the csv-file

loaded_data <- read_csv("data/learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): Age, Attitude, Points, stra, deep, surf
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(loaded_data)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ Age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: num [1:166] 37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   Age = col_double(),
##   ..   Attitude = col_double(),
##   ..   Points = col_double(),
##   ..   stra = col_double(),
##   ..   deep = col_double(),
##   ..   surf = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(loaded_data)
## # A tibble: 6 × 7
##   gender   Age Attitude Points  stra  deep  surf
##   <chr>  <dbl>    <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 F         53       37     25  3.38  3.58  2.58
## 2 M         55       31     12  2.75  2.92  3.17
## 3 F         49       25     24  3.62  3.5   2.25
## 4 M         53       35     10  3.12  3.5   2.25
## 5 M         49       37     22  3.62  3.67  2.83
## 6 F         38       38     21  3.62  4.75  2.42

##Data analysis

In this section we will focus further on analysing the data itself. The brief description of the contents of the data have been presented in the beginning of the file.

#Creating the correlation matrix

library("Hmisc")
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library("corrplot")
## corrplot 0.92 loaded
library(ggplot2)

#Gender correlation

gender_graph <- ggplot(data.frame(analysis_dataset), aes(x = gender)) +
  geom_bar(fill = "lightgreen")
print(gender_graph)

#Points correlation

points_dist <- ggplot(data.frame(analysis_dataset), aes(x =Points)) +
  geom_density(fill = "lightgreen")
print(points_dist)

#Age correlation

age_dist <- points_dist <- ggplot(data.frame(analysis_dataset), aes(x = Age)) +
  geom_density(fill = "lightgreen")
print(points_dist)

#deep correlation

deep_dist <- ggplot(data.frame(analysis_dataset), aes(x = deep)) +
  geom_density(fill = "lightgreen")
print(deep_dist)

#surf correlation

surf_dist <- ggplot(data.frame(analysis_dataset), aes(x = surf)) +
  geom_density(fill = "lightgreen")
print(surf_dist)

#stra correlation

stra_dist <- ggplot(data.frame(analysis_dataset), aes(x = stra)) +
  geom_density(fill = "lightgreen")
print(stra_dist)

#Creating a correlation graph of variables:“Attitude”,“deep”,“stra”,“surf”,“Points” using Pearson

correlations = c("Attitude","deep","stra","surf","Points")
analysis_cor = cor(analysis_dataset[correlations])
corrplot(analysis_cor)

plot(analysis_dataset)

#By having a look at the correlation analysis we can identify the factors with the highest correspondance. Those will be further explained below.

##Attitude vs Points

p1 <- ggplot(data.frame(analysis_dataset), aes(x = Attitude, y = Points))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3+ggtitle("Attitude vs Points")
print(p4)
## `geom_smooth()` using formula = 'y ~ x'

model <- lm(analysis_dataset$Points~analysis_dataset$Attitude)
summary(model)
## 
## Call:
## lm(formula = analysis_dataset$Points ~ analysis_dataset$Attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               11.63715    1.83035   6.358 1.95e-09 ***
## analysis_dataset$Attitude  0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

#p-value being lowest in this correlation graph.

#Other way of doing the same thing

library(ggplot2)
qplot(Attitude, Points, data = analysis_dataset) + geom_smooth(method = "lm")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

my_model <- lm(Points ~ Attitude, data = analysis_dataset)

#surf vs deep

p1 <- ggplot(data.frame(analysis_dataset), aes(x = deep, y = surf))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3+ggtitle("deep vs surf")
print(p4)
## `geom_smooth()` using formula = 'y ~ x'

model <- lm(analysis_dataset$surf~analysis_dataset$deep)
summary(model)
## 
## Call:
## lm(formula = analysis_dataset$surf ~ analysis_dataset$deep)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.60164 -0.34081  0.01032  0.30539  1.12548 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.92426    0.26235  14.958  < 2e-16 ***
## analysis_dataset$deep -0.30902    0.07051  -4.383 2.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5019 on 164 degrees of freedom
## Multiple R-squared:  0.1048, Adjusted R-squared:  0.09939 
## F-statistic: 19.21 on 1 and 164 DF,  p-value: 2.085e-05

#p-value being second lowest in this comparison

#Attitude vs surf

p1 <- ggplot(data.frame(analysis_dataset), aes(x = Attitude, y = surf))
p2 <- p1 + geom_point()
p3 <- p2 + geom_smooth(method = "lm")
p4 <- p3+ggtitle("Attitude vs surf")
print(p4)
## `geom_smooth()` using formula = 'y ~ x'

model <- lm(analysis_dataset$surf~analysis_dataset$Attitude)
summary(model)
## 
## Call:
## lm(formula = analysis_dataset$surf ~ analysis_dataset$Attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.08644 -0.40136 -0.01364  0.35040  1.50259 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.18686    0.17968  17.737   <2e-16 ***
## analysis_dataset$Attitude -0.01272    0.00557  -2.283   0.0237 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5222 on 164 degrees of freedom
## Multiple R-squared:  0.03082,    Adjusted R-squared:  0.02491 
## F-statistic: 5.214 on 1 and 164 DF,  p-value: 0.02368

#p-value being third lowest in this comparison. The abovementioned three statistical models (Attitude vs surf, surf vs deep, Attitude vs Points) show a rather significant coefficient. In the linear model especially Attitude (predictor variable) vs Points (response variable) have a statistically strong correlation.

#Creating a multiple variable model

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
ggpairs(analysis_dataset, lower = list(combo = wrap("facethist", bins = 20)))

my_model2 <- lm(Points ~ Attitude + stra, data = analysis_dataset)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = analysis_dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
my_model2 <- lm(Points ~ Attitude + stra, data = analysis_dataset)

#Linear model with multiple variables

lm(my_model2)
## 
## Call:
## lm(formula = my_model2)
## 
## Coefficients:
## (Intercept)     Attitude         stra  
##      8.9729       0.3466       0.9137
par(mfrow = c(2, 2))

#Residuals vs Fitted values (which = 1)

plot(my_model2, which = 1)

#Normal QQ-plot (which = 2)

plot(my_model2, which = 2)

#Residuals vs Leverage (which = 5)

plot(my_model2, which = 5)

#Multiple explanatory variables all in one

plot(my_model2, which = c(1, 2, 5))

In the Residual vs Fitted plot the points are fairly randomly scattered and follow the line. It suggest the assumption is reasonable. In the QQ plot there is some deviation from the norm indicating departure from normality. Residuals vs Leverage on the other hand shows a uneven distribution suggesting heteroscedasticity.

#Making predictions

m <- lm(Points ~ Attitude, data = analysis_dataset)
summary(m)
## 
## Call:
## lm(formula = Points ~ Attitude, data = analysis_dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

As we can se above the Adjusted R-squared is 0.1856. This is not a sign of a good model.

#New observations

new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(Attitude = new_attitudes)
print(new_data)
##        Attitude
## Mia         3.8
## Mike        4.4
## Riikka      2.2
## Pekka       2.9

#Predicting exam points based on the model created above.

predictions <- predict(m, newdata = new_data)
print(predictions)
##      Mia     Mike   Riikka    Pekka 
## 12.97683 13.18836 12.41275 12.65954

Here we go again…


Assignment 3

Konrad Sopyllo

The data presented below concers student performance in Portugese high school education. It focuses mainly onperformance in two distinct subjects: Mathematics (mat) and Portuguese language (por). The data was collected by Cortez and Silva in 2008.

url <- "https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv"
alc_data <- read.csv(url)

variable_names <- names(alc_data)
cat("Variable Names:", paste(variable_names, collapse = ", "), "\n\n")
## Variable Names: school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, schoolsup, famsup, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, failures, paid, absences, G1, G2, G3, alc_use, high_use

In this section we will choose four random factors and analyse their correlation with alcohol consumption. First we will make a hypothesis on the result and then calculate it. In addition we will interpret coefficients of the models and their odds ratios

#Hypothesis 1. Romantic relationship is correlated with higher alcohol use.

library(tidyverse)
gathered_data <- alc_data %>%
  gather(key = "variable", value = "alcohol_consumption", Dalc, Walc)

ggplot(gathered_data, aes(x = romantic, fill = variable, y = alcohol_consumption)) +
  geom_bar(position = "dodge", stat = "summary", fun = "mean", color = "black", alpha = 0.7) +
  labs(title = "Average Alcohol Consumption vs Romantic relationship",
       x = "romantic",
       y = "Average Alcohol Consumption") +
  scale_fill_manual(values = c("red", "green"), name = "Alcohol Consumption",
                    labels = c("Workday", "Weekend")) +
  theme_minimal()

logistic_model <- glm(high_use ~ Dalc + Walc + romantic, 
                      data = alc_data, 
                      family = binomial,
                      maxit = 1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
## 
## Call:
## glm(formula = high_use ~ Dalc + Walc + romantic, family = binomial, 
##     data = alc_data, maxit = 1000)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.432e+02  4.161e+05  -0.001        1
## Dalc         5.423e+01  1.122e+05   0.000        1
## Walc         5.397e+01  9.790e+04   0.001        1
## romanticyes -2.844e-01  9.737e+04   0.000        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5204e+02  on 369  degrees of freedom
## Residual deviance: 3.4622e-10  on 366  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 29
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
## 
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios, 
           Lower_CI = conf_intervals[, 1], 
           Upper_CI = conf_intervals[, 2])
##                Odds_Ratio Lower_CI Upper_CI
## (Intercept) 2.333493e-106        0        0
## Dalc         3.569417e+23        0      Inf
## Walc         2.755430e+23        0      Inf
## romanticyes  7.524331e-01        0       NA

Romantic relationship decreases the chances of high alcohol use. The odds ratio is 0.8 which means that the chances are smaller when in romantic realtionship.

#Hypothesis 2. more activities is related to higher alcohol use.

gathered_data <- alc_data %>%
  gather(key = "variable", value = "alcohol_consumption", Dalc, Walc)

ggplot(gathered_data, aes(x = activities, fill = variable, y = alcohol_consumption)) +
  geom_bar(position = "dodge", stat = "summary", fun = "mean", color = "black", alpha = 0.7) +
  labs(title = "Average Alcohol Consumption vs Activities",
       x = "Activities",
       y = "Average Alcohol Consumption") +
  scale_fill_manual(values = c("red", "green"), name = "Alcohol Consumption",
                    labels = c("Workday", "Weekend")) +
  theme_minimal()

#regression analysis

logistic_model <- glm(high_use ~ Dalc + Walc + activities, 
                      data = alc_data, 
                      family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
## 
## Call:
## glm(formula = high_use ~ Dalc + Walc + activities, family = binomial, 
##     data = alc_data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -207.2808 56619.7972  -0.004    0.997
## Dalc             46.2209 15226.4671   0.003    0.998
## Walc             45.9976 13263.2386   0.003    0.997
## activitiesyes    -0.1948 12574.1620   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5204e+02  on 369  degrees of freedom
## Residual deviance: 1.8847e-08  on 366  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 25
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
## 
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios, 
           Lower_CI = conf_intervals[, 1], 
           Upper_CI = conf_intervals[, 2])
##                 Odds_Ratio      Lower_CI      Upper_CI
## (Intercept)   9.530179e-91  0.000000e+00  0.000000e+00
## Dalc          1.184331e+20 1.865684e-115 7.518102e+154
## Walc          9.473186e+19           Inf           Inf
## activitiesyes 8.230000e-01  0.000000e+00            NA
logistic_model <- glm(high_use ~ Dalc + Walc + goout, 
                      data = alc_data, 
                      family = binomial,
                      maxit = 1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
## 
## Call:
## glm(formula = high_use ~ Dalc + Walc + goout, family = binomial, 
##     data = alc_data, maxit = 1000)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.434e+02  4.263e+05  -0.001        1
## Dalc         5.424e+01  1.121e+05   0.000        1
## Walc         5.399e+01  9.967e+04   0.001        1
## goout       -1.091e-03  4.686e+04   0.000        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5204e+02  on 369  degrees of freedom
## Residual deviance: 3.4654e-10  on 366  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 29
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
## 
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios, 
           Lower_CI = conf_intervals[, 1], 
           Upper_CI = conf_intervals[, 2])
##                Odds_Ratio Lower_CI Upper_CI
## (Intercept) 1.980652e-106        0        0
## Dalc         3.582535e+23        0      Inf
## Walc         2.813755e+23        0      Inf
## goout        9.989097e-01       NA      Inf

Extracurricular activities decrease alchol consumption. -p-value is low thus the result is significant.

#Hypothesis 3. Amount of going out is correlated with higher alcohol use

gathered_data <- alc_data %>%
  gather(key = "variable", value = "alcohol_consumption", Dalc, Walc)

ggplot(gathered_data, aes(x = goout, fill = variable, y = alcohol_consumption)) +
  geom_bar(position = "dodge", stat = "summary", fun = "mean", color = "black", alpha = 0.7) +
  labs(title = "Average Alcohol Consumption vs Going out",
       x = "Going out",
       y = "Average Alcohol Consumption") +
  scale_fill_manual(values = c("red", "green"), name = "Alcohol Consumption",
                    labels = c("Workday", "Weekend")) +
  theme_minimal()

logistic_model <- glm(high_use ~ Dalc + Walc + goout, 
                      data = alc_data, 
                      family = binomial,
                      maxit = 1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
## 
## Call:
## glm(formula = high_use ~ Dalc + Walc + goout, family = binomial, 
##     data = alc_data, maxit = 1000)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.434e+02  4.263e+05  -0.001        1
## Dalc         5.424e+01  1.121e+05   0.000        1
## Walc         5.399e+01  9.967e+04   0.001        1
## goout       -1.091e-03  4.686e+04   0.000        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5204e+02  on 369  degrees of freedom
## Residual deviance: 3.4654e-10  on 366  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 29
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
## 
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios, 
           Lower_CI = conf_intervals[, 1], 
           Upper_CI = conf_intervals[, 2])
##                Odds_Ratio Lower_CI Upper_CI
## (Intercept) 1.980652e-106        0        0
## Dalc         3.582535e+23        0      Inf
## Walc         2.813755e+23        0      Inf
## goout        9.989097e-01       NA      Inf

The more you go out, the more you drink alcohol. -Given the p-valuen (0.02) the results are statistically significant. -One unit increase in going out, alcohol consumption increases by 1.3

#Hypothesis 4. increased internet usage is correlated with smaller alcohol consumption.

gathered_data <- alc_data %>%
  gather(key = "variable", value = "alcohol_consumption", Dalc, Walc)

ggplot(gathered_data, aes(x = internet, fill = variable, y = alcohol_consumption)) +
  geom_bar(position = "dodge", stat = "summary", fun = "mean", color = "black", alpha = 0.7) +
  labs(title = "Average Alcohol Consumption vs Internet Use",
       x = "Internet Use",
       y = "Average Alcohol Consumption") +
  scale_fill_manual(values = c("red", "green"), name = "Alcohol Consumption",
                    labels = c("Workday", "Weekend")) +
  theme_minimal()

logistic_model <- glm(high_use ~ Dalc + Walc + internet, 
                      data = alc_data, 
                      family = binomial,
                      maxit = 1000)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
## 
## Call:
## glm(formula = high_use ~ Dalc + Walc + internet, family = binomial, 
##     data = alc_data, maxit = 1000)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -243.506 429674.814  -0.001        1
## Dalc            54.231 112021.326   0.000        1
## Walc            53.991  97780.360   0.001        1
## internetyes      0.148 131720.810   0.000        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4.5204e+02  on 369  degrees of freedom
## Residual deviance: 3.4659e-10  on 366  degrees of freedom
## AIC: 8
## 
## Number of Fisher Scoring iterations: 29
odds_ratios <- exp(coef(logistic_model))
conf_intervals <- exp(confint(logistic_model))
## Waiting for profiling to be done...
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
cat("\nOdds Ratios and 95% Confidence Intervals:\n")
## 
## Odds Ratios and 95% Confidence Intervals:
data.frame(Odds_Ratio = odds_ratios, 
           Lower_CI = conf_intervals[, 1], 
           Upper_CI = conf_intervals[, 2])
##                Odds_Ratio Lower_CI Upper_CI
## (Intercept) 1.764963e-106        0        0
## Dalc         3.566017e+23        0      Inf
## Walc         2.804818e+23      Inf      Inf
## internetyes  1.159493e+00       NA      Inf

Internet use is increased during alcohol consumption. Thus the assumption is false. -Given the p-valuen (0.02) the results are statistically significant. -One unit increase in internet usage, alcohol consumption increases by 1.2

#In the last section we do a prediction power model vs simple guessing of the results.

predictions <- predict(logistic_model, type = "response") > 0.5

conf_matrix <- table(Actual = alc_data$high_use, Predicted = predictions)

print(conf_matrix)
##        Predicted
## Actual  FALSE TRUE
##   FALSE   259    0
##   TRUE      0  111
training_error <- sum(alc_data$high_use != predictions) / nrow(alc_data)
cat("\nTraining Error:", training_error, "\n")
## 
## Training Error: 0
library(ggplot2)
ggplot(alc_data, aes(x = as.factor(high_use), fill = as.factor(predictions))) +
  geom_bar(position = "dodge") +
  labs(title = "Actual vs. Predicted Values",
       x = "Actual Values",
       y = "Count",
       fill = "Predicted Values") +
  theme_minimal()

training_error <- sum(alc_data$high_use != predictions) / nrow(alc_data)
cat("\nTraining Error:", training_error, "\n")
## 
## Training Error: 0
majority_class <- as.numeric(names(sort(table(alc_data$high_use), decreasing = TRUE)[1]))
## Warning: NAs introduced by coercion
simple_guess_predictions <- rep(majority_class, nrow(alc_data))


model_accuracy <- sum(alc_data$high_use == predictions) / nrow(alc_data)


simple_guess_accuracy <- sum(alc_data$high_use == simple_guess_predictions) / nrow(alc_data)

cat("Model Accuracy:", model_accuracy, "\n")
## Model Accuracy: 1
cat("Simple Guessing Accuracy:", simple_guess_accuracy, "\n")
## Simple Guessing Accuracy: NA

The model has 100% success rate and no false results. There are also 0 incorrectly typed individuals.

#Bonus I

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
alc_data$high_use <- as.factor(alc_data$high_use)

set.seed(123)

ctrl <- trainControl(method = "cv", number = 10)

cv_model <- train(high_use ~ Dalc + Walc + romantic, data = alc_data, method = "glm", family = binomial, trControl = ctrl)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(cv_model)
## Generalized Linear Model 
## 
## 370 samples
##   3 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 333, 333, 333, 333, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   1         1
previous_model_error <- 0.26  

cv_model_error <- 1 - cv_model$results$Accuracy

cat("\nPrevious Model Error:", previous_model_error, "\n")
## 
## Previous Model Error: 0.26
cat("Cross-Validation Model Error:", mean(cv_model_error), "\n")
## Cross-Validation Model Error: 0

Bonus I: Performing a 10 fold cross validation and checking the test performance.

In the first section, the coding contains creation of the training control. Since both accuracy and Kappa are 1, this means the model has a perfect classification and agreement of outome respectively. It also performs better than the prevous model, since the error is 0.

#Bonus II


Assignment 4

Konrad Sopyllo

The dataset used for this weeks assignment describes the housing values in Suburbs of Boston. The set is shown below. The meaning of the factors are as follows:crim = per capita crime rate by town, zn = proportion of residential land zoned for lots over 25,000 sq.ft., indus = proportion of non-retail business acres per town, chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise), nox = nitrogen oxides concentration (parts per 10 million), rm = average number of rooms per dwelling, age = proportion of owner-occupied units built prior to 1940, dis = weighted mean of distances to five Boston employment centres, rad = index of accessibility to radial highways, tax = full-value property-tax rate per $10,000, ptratio = pupil-teacher ratio by town, black = 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town, lstat = lower status of the population (percent), medv = median value of owner-occupied homes in $1000s

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Showing a graphical overview of the data and correlations. It is used to predict the value of owner-occupied homes.

par(mar = c(2, 2, 1, 1))
par(mfrow = c(4, 4))
for (i in 1:14) {
  plot(Boston[, i], main = names(Boston)[i], col = "lightblue", pch = 20)
}
par(mfrow = c(1, 1))

par(mar = c(5, 4, 4, 2) + 0.1) 

#standardizing the data set and creating a training set.

scaled_data <- as.data.frame(scale(Boston))

summary(scaled_data)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
scaled_data$crime_category <- cut(scaled_data$crim, breaks = quantile(scaled_data$crim), labels = FALSE)

scaled_data <- scaled_data[, -which(colnames(scaled_data) == "crim")]

set.seed(123)

train_indices <- sample(1:nrow(scaled_data), 0.8 * nrow(scaled_data))
train_set <- scaled_data[train_indices, ]
test_set <- scaled_data[-train_indices, ]

Mean of each variable is now close to 0 and standard deviation is close to 1.

Linear discriminant analysis

library(ggplot2)

lda_model <- lda(crime_category ~ ., data = train_set)

lda_data <- data.frame(predict(lda_model, train_set)$x, crime_category = train_set$crime_category)

ggplot(lda_data, aes(x = LD1, y = LD2, color = crime_category)) +
  geom_point() +
  ggtitle("LDA Biplot") +
  xlab("LD1") +
  ylab("LD2") +
  theme_minimal()

In the plot shown above the x and y axes represent the linear discriminant functions. They are linear combinations of the original variables.
LD1: the first linear discriminant function maximizes the separation between different categories LD2:The second linear discriminant function, which is orthogonal to the former, captures additional variation not explained by the LD1. In the plot we can crealy see that the higher crime categories (mainly 4) are separated from the rest. This means less violent crimes can be separated from the rest.

#Predicting classes with he LDA model

actual_crime_categories <- test_set$crime_category

test_set <- test_set[, -which(colnames(test_set) == "crime_category")]

predicted_classes <- predict(lda_model, newdata = test_set)$class

confusion_matrix <- table(actual = actual_crime_categories, predicted = predicted_classes)

print(confusion_matrix)
##       predicted
## actual  1  2  3  4
##      1  9 10  4  0
##      2  2 17  6  0
##      3  1  9 13  2
##      4  0  0  1 27
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Overall Accuracy:", round(accuracy, 4), "\n")
## Overall Accuracy: 0.6535

The plot above consists of tru positive, true negative, false positive and false negative values. Hence numbers 1-4. Within the plot we can se instances fitting these categories. The overall accuracy is (TP+TN)/Total. In this case the overall accuracy is 65% whis is not great.

#K-means algorithm

data(Boston)

scaled_data <- scale(Boston)

distances <- dist(scaled_data)

set.seed(123)
k_values <- 2:10
kmeans_results <- lapply(k_values, function(k) kmeans(scaled_data, centers = k))

wss <- sapply(kmeans_results, function(km) sum(km$withinss))
plot(k_values, wss, type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of Clusters (k)", ylab = "Within-Cluster Sum of Squares (WSS)",
     main = "Elbow Method for Optimal k")

optimal_k <- k_values[which.min(wss)]
optimal_kmeans <- kmeans(scaled_data, centers = optimal_k)

pairs(scaled_data, col = optimal_kmeans$cluster)

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
scaled_data_df <- as.data.frame(scaled_data)

pairs(scaled_data_df, col = optimal_kmeans$cluster)

fviz_cluster(optimal_kmeans, data = scaled_data_df, geom = "point",
             ggtheme = theme_minimal(), ellipse.type = "convex", ellipse.level = 0.68)

First the variables in Boston dataset are being standardized. In order to alculate the distances between the observations based on the standardized variables we use dist(). The elbow method is used to identify the optimal number of clusters by using the within-cluster sum of squares (WSS). Then k-means algorithm is run again with the optimal number of clusters. For better visualisation the data is presented above in the graphs. As shown in the first graph, 10 is good number for the amount of clusters. In the cluster plot on the other hand we can see some clusters being more distant from another. In general the seem to overlap with neighbouring ones.

#Bonus I -Performing a k-means on the original data with >2 clusters.

library(MASS)
library(cluster)

data(Boston)

scaled_data <- scale(Boston)

set.seed(123)
num_clusters <- 3
kmeans_results <- kmeans(scaled_data, centers = num_clusters)

lda_model <- lda(factor(kmeans_results$cluster) ~ ., data = Boston)

lda_scores <- predict(lda_model)$x

plot(lda_scores[, 1], lda_scores[, 2], col = as.factor(kmeans_results$cluster), pch = 16,
     main = "LDA Biplot-Like Visualization", xlab = "LD1", ylab = "LD2")

points(lda_model$means[, 1], lda_model$means[, 2], col = 1:num_clusters, pch = 3, cex = 2)
legend("topright", legend = 1:num_clusters, col = 1:num_clusters, pch = 3, title = "Clusters")

As we can see in the graph above the cluster 1 is certainly more detached than the others. Cluster 2 and 3 are more or less grouped together but there is still a rather strong separation visible at LD2 arounf point 0.

#Bonus II


Assignment 5

Konrad Sopyllo

date()
## [1] "Wed Dec  6 18:51:59 2023"

To avoid errors I will use Kimmos ready data.

library(readr)
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Data relations

rownames(human) <- human$Country
## Warning: Setting row names on a tibble is deprecated.
human$Country <- NULL  

head(human)
## # A tibble: 6 × 8
##   Edu2.FM Labo.FM Life.Exp Edu.Exp   GNI Mat.Mor Ado.Birth Parli.F
##     <dbl>   <dbl>    <dbl>   <dbl> <dbl>   <dbl>     <dbl>   <dbl>
## 1   1.01    0.891     81.6    17.5 64992       4       7.8    39.6
## 2   0.997   0.819     82.4    20.2 42261       6      12.1    30.5
## 3   0.983   0.825     83      15.8 56431       6       1.9    28.5
## 4   0.989   0.884     80.2    18.7 44025       5       5.1    38  
## 5   0.969   0.829     81.6    17.9 45435       6       6.2    36.9
## 6   0.993   0.807     80.9    16.5 43919       7       3.8    36.9
library(ggplot2)
library(dplyr)

subset_vars <- c("Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")
subset_data <- human[, subset_vars]

pairs(subset_data)

summary(subset_data)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

By looking at the data above we ca see that Expected years of education (Edu.Exp) and life expectancy (Life.Exp) present with aa positive linear trend. It suggests that with the progression of Edu.Exp there is a correlation with Life.Exp. In other words the higher the education level, the higher the life expectancy. No clear assosiation is visible with other variables.

#Principal component analysis

pca_result <- prcomp(human[, c("Edu2.FM", "Labo.FM", "Edu.Exp", "Life.Exp", "GNI", "Mat.Mor", "Ado.Birth", "Parli.F")], scale = TRUE)

library(ggplot2)
library(ggrepel)

ggplot(data = as.data.frame(pca_result$x), aes(x = PC1, y = PC2)) +
  geom_point(size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "gray") +
  geom_text_repel(aes(label = rownames(as.data.frame(pca_result$x))), box.padding = 0.5, max.overlaps = Inf) +
  labs(title = "PCA Biplot") +
  theme_minimal()

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("FactoMineR")
## 
## The downloaded binary packages are in
##  /var/folders/4g/x6rsz1195b5dtvzn9lvwxp1w0000gn/T//Rtmp5wCBvB/downloaded_packages
library(FactoMineR)

human_standardized <- scale(human)


pca_result_standardized <- PCA(human_standardized, graph = FALSE)

plot(pca_result_standardized, choix = 'ind', cex = 0.8, col.hab = as.factor(rownames(human)))

The biplot shows the relationship between ountries based on on PC1 and PC2. The arrows represent original variables and their correlation. Countries close to each other show similar pattern of variation.

In the standardized PCA values variables with higher SD are emphasized. Overall standardisation is beneficial and presents with more clear data.

Variables pointing in the direction of PC1 have a strong influence on its variation. Arrow lenght indicates the strength of each variable’s contribution. Countries to the left of the axis PC1 have a lower score. On the other hand varbiales pointing in the direction of PC2 have a strong influence on its variation.

install.packages(c("FactoMineR", "factoextra"))
## 
## The downloaded binary packages are in
##  /var/folders/4g/x6rsz1195b5dtvzn9lvwxp1w0000gn/T//Rtmp5wCBvB/downloaded_packages
library(FactoMineR)
library(factoextra)

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

tea_mca_data <- tea[, c("breakfast", "tea.time", "evening", "lunch", "dinner", "always", "home", "work", 
                        "tearoom", "friends", "resto", "pub", "Tea", "How", "sugar", "how", "where", 
                        "price", "age", "sex", "SPC", "Sport", "age_Q", "frequency", "escape.exoticism", 
                        "spirituality", "healthy", "diuretic", "friendliness", "iron.absorption", 
                        "feminine", "sophisticated", "slimming", "exciting", "relaxing", 
                        "effect.on.health")]

tea_mca_data[] <- lapply(tea_mca_data, as.factor)

mca_result <- MCA(tea_mca_data)

## Warning: ggrepel: 29 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

fviz_mca_biplot(mca_result, col.ind = "cos2", col.ind.sup = "blue", 
                cex = 0.7, title = "MCA Variable Biplot")
## Warning: Duplicated aesthetics after name standardisation: size
## Warning: Duplicated aesthetics after name standardisation: size

plot.MCA(mca_result, invisible = "ind", col.ind = "cos2", col.ind.sup = "blue", 
          cex = 0.5, title = "MCA Variable Biplot")

As shown in the Biplot factors placed closest to each other have the strongest correlation with each other.


Assignment 6

(date)
## function (x) 
## {
##     UseMethod("date")
## }
## <bytecode: 0x7fd6c61d44d8>
## <environment: namespace:lubridate>
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(MASS)
library(corrplot)
library(GGally)
library(tibble)
library(FactoMineR)
library(factoextra)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
rats <- read.table("data/rats.txt", sep = ",", header = T)

str(rats)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ time  : int  1 1 1 1 1 1 1 1 1 1 ...

Changing the categorical variables ID, Group and WD

rats$ID <- factor(rats$ID)
rats$Group <- factor(rats$Group)
rats$WD <- factor(rats$WD)
str(rats)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ time  : int  1 1 1 1 1 1 1 1 1 1 ...

The data compares three different diet types (groups) and their effect on the rats development ie mass. The weight has been measured over 9 weeks with 11 different time points.

rats$ID <- factor(rats$ID)
rats$Group <- factor(rats$Group)
rats$WD <- factor(rats$WD)
str(rats)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ time  : int  1 1 1 1 1 1 1 1 1 1 ...

Weight development over time in different groups

ggplot(rats, aes(x = time, y = weight, linetype=ID, col=Group)) +geom_line() + scale_linetype_manual(values = rep(1:10, times=4)) + theme(legend.position = "none") + scale_y_continuous(limits = c(min(rats$weight), max(rats$weight))) + facet_grid(. ~ Group, labeller = label_both) + xlab("time (days)")

Plotting values using standardized values. Shows the distribution better in the beginning.

rats <- rats %>% group_by(time) %>% mutate(stdweight = (weight-mean(weight))/sd(weight)) %>% ungroup()

ggplot(rats, aes(x = time, y = stdweight, linetype=ID, col=Group)) +geom_line() + scale_linetype_manual(values = rep(1:10, times=4)) + theme(legend.position = "none") + facet_grid(. ~ Group, labeller = label_both) + xlab("time (days)") + ylab("scaled weight")

As we can see the Group 1 has a much lower weight to from the very beginning, it has also some minimal increase throughout the observation period. Groups 2 and 3 have higher weight increase throughout the observation period however

rats2 <- rats %>% group_by(Group, time) %>% summarise( mean = mean(weight), se = (sd(weight)/sqrt(length(weight)))) %>% ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
#average weight change over the observation period
ggplot(rats2, aes(x = time, y = mean, linetype = Group, shape = Group, col=Group)) + geom_line() + scale_linetype_manual(values = c(1,2,3)) + geom_point(size=3) + scale_shape_manual(values = c(1,2,3)) + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) + scale_y_continuous(name = "mean(weight) +/- se(weight)") + scale_x_continuous(name = "time (days)")

#total change in weight
rats3 <- rats %>% filter(time > 1) %>% group_by(Group, ID) %>% summarise(mean=mean(weight)) %>% ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(rats3, aes(x = Group, y = mean, col=Group)) + geom_boxplot() + stat_summary(fun = "mean", geom = "point", shape=23, size=2.5, fill = "white") + scale_y_continuous(name = "mean(weight), days 8-64")

rats4 <- subset(rats3, rats3$mean<550)

ggplot(rats4, aes(x = Group, y = mean, col=Group)) + geom_boxplot() + stat_summary(fun = "mean", geom = "point", shape=23, size=2.5, fill = "white") + scale_y_continuous(name = "mean(weight), days 8-64")

Group 1 has the smallest weight. Group 2 has the biggest increase in weight gain. Group 3 have the biggest weight however the gain is not as big as in group 2.

##T-test to evaluate the group differences and suitability into the linear model

rats5 <- subset(rats4, rats4$Group==1 | rats4$Group==2)
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -25.399, df = 9, p-value = 1.094e-09
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  -204.0633 -170.6867
## sample estimates:
## mean in group 1 mean in group 2 
##         265.025         452.400
#Group 1 versus Group 2
rats5 <- subset(rats4, rats4$Group==1 | rats4$Group==3)
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3 
##         265.025         527.500
#Group 1 versus Group 3
rats5 <- subset(rats4, rats4$Group==2 | rats4$Group==3)
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -5.632, df = 5, p-value = 0.002446
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -109.37769  -40.82231
## sample estimates:
## mean in group 2 mean in group 3 
##           452.4           527.5
#Group 2 versus Group 3
rats5 <- subset(rats4, rats4$Group==2 | rats4$Group==3)
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -5.632, df = 5, p-value = 0.002446
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -109.37769  -40.82231
## sample estimates:
## mean in group 2 mean in group 3 
##           452.4           527.5
#Group 2 versus Group 3
rats5 <- subset(rats3, rats3$Group==2 | rats4$Group==3)
## Warning in rats3$Group == 2 | rats4$Group == 3: longer object length is not a
## multiple of shorter object length
t.test(mean ~ Group, data = rats5, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  mean by Group
## t = -0.83974, df = 5, p-value = 0.4393
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
##  -149.45079   75.85079
## sample estimates:
## mean in group 2 mean in group 3 
##           487.8           524.6
baseline <- subset(rats, rats$time==1)
rats5 <- rats3 %>% mutate(baseline = baseline$weight)

fit <- lm(mean ~ baseline + Group, data = rats5)

anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing Group 1 to Groups 2 and 3 has small p-values. Hence there is a significant difference in the weight. Comparing groups 2 and 3 shows already a smaller p -value. Moreover removing the outlines increases the p-value even more questioning the correlation.

Based on the Anova analysis it might be good to choose rats with a similar starting weight as that is the strongest predictive value.

#BPRS dataset

bprs <- read.table("data/bprs.txt", header=T, sep = ",")

str(bprs)
## 'data.frame':    360 obs. of  4 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dose     : int  42 58 54 55 72 48 71 30 41 57 ...
bprs$treatment <- factor(bprs$treatment)
bprs$subject <- factor(bprs$subject)
bprs$week <- factor(bprs$week)
str(bprs)
## 'data.frame':    360 obs. of  4 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ week     : Factor w/ 9 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ dose     : int  42 58 54 55 72 48 71 30 41 57 ...
ggplot(bprs, aes(x = as.numeric(week), y = dose, linetype = subject, colour = treatment)) +
  geom_line(alpha = 0.5) +
  scale_linetype_manual(values = rep(1:10, times = 40), guide = FALSE) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  xlab("Week") + ylab("Dose")
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on this graph it is impossible to say about any correlations.

bprs_lm <- lm(dose ~ as.numeric(week) + treatment, data = bprs)

summary(bprs_lm)
## 
## Call:
## lm(formula = dose ~ as.numeric(week) + treatment, data = bprs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       48.7243     1.5627  31.179   <2e-16 ***
## as.numeric(week)  -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2         0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

As seen above time (in weeks) is the only variable strongly associated with the treatment.

Now lets create the intercept model

bprs_ref <- lmer(dose ~ as.numeric(week) + treatment + (1 | subject), data = bprs, REML = FALSE)

summary(bprs_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: dose ~ as.numeric(week) + treatment + (1 | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)       48.7243     2.0087  24.256
## as.numeric(week)  -2.2704     0.2084 -10.896
## treatment2         0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##             (Intr) as.n()
## as.nmrc(wk) -0.519       
## treatment2  -0.268  0.000

A random intercept model to allow for the different terms to show interindividual differences.

bprs_ref1 <- lmer(dose ~ as.numeric(week) + treatment + (week | subject), data = bprs, REML = FALSE)
## boundary (singular) fit: see help('isSingular')
summary(bprs_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: dose ~ as.numeric(week) + treatment + (week | subject)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2822.4   3012.8  -1362.2   2724.4      311 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8998 -0.5630 -0.0911  0.5802  3.3754 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr                               
##  subject  (Intercept) 65.74    8.108                                       
##           week1       12.24    3.498     0.27                              
##           week2       13.50    3.674    -0.21  0.62                        
##           week3       22.45    4.738    -0.25  0.83  0.88                  
##           week4       33.81    5.815    -0.59  0.60  0.56  0.84            
##           week5       42.18    6.495    -0.72  0.39  0.36  0.67  0.96      
##           week6       53.27    7.299    -0.64  0.44  0.30  0.65  0.96  0.99
##           week7       51.17    7.153    -0.43  0.56  0.21  0.62  0.91  0.92
##           week8       51.65    7.187    -0.30  0.65  0.22  0.65  0.88  0.87
##  Residual             92.73    9.630                                       
##             
##             
##             
##             
##             
##             
##             
##             
##   0.96      
##   0.92  0.99
##             
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)       46.2814     2.1363  21.664
## as.numeric(week)  -2.1561     0.2893  -7.453
## treatment2         0.5722     1.0150   0.564
## 
## Correlation of Fixed Effects:
##             (Intr) as.n()
## as.nmrc(wk) -0.665       
## treatment2  -0.238  0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

A random intercept model to allow for the different terms to show interindividual differences.

bprs_ref2 <- lmer(dose ~ as.numeric(week) + treatment + (week | subject) + (week | treatment), data = bprs, REML = FALSE)
## boundary (singular) fit: see help('isSingular')
summary(bprs_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: dose ~ as.numeric(week) + treatment + (week | subject) + (week |  
##     treatment)
##    Data: bprs
## 
##      AIC      BIC   logLik deviance df.resid 
##   2902.4   3267.7  -1357.2   2714.4      266 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0574 -0.5318 -0.0809  0.5350  3.4227 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev. Corr                               
##  subject   (Intercept) 52.74236 7.2624                                      
##            week1       11.76718 3.4303    0.31                              
##            week2       10.83352 3.2914    0.05  0.69                        
##            week3       17.56874 4.1915   -0.02  0.92  0.85                  
##            week4       25.59230 5.0589   -0.46  0.66  0.39  0.78            
##            week5       30.80826 5.5505   -0.59  0.46  0.13  0.57  0.96      
##            week6       44.59487 6.6779   -0.53  0.47  0.08  0.55  0.95  1.00
##            week7       51.16336 7.1529   -0.41  0.55  0.09  0.58  0.94  0.98
##            week8       53.08785 7.2861   -0.34  0.63  0.14  0.64  0.95  0.96
##  treatment (Intercept)  0.09326 0.3054                                      
##            week1        1.31922 1.1486   -1.00                              
##            week2        7.13953 2.6720   -0.50  0.50                        
##            week3        6.37209 2.5243   -0.50  0.50  1.00                  
##            week4        2.19482 1.4815   -0.17  0.17  0.94  0.94            
##            week5        5.40404 2.3247    0.10 -0.10  0.81  0.81  0.96      
##            week6        0.88994 0.9434    0.88 -0.88 -0.04 -0.03  0.31  0.56
##            week7        8.24296 2.8711    0.06 -0.06 -0.89 -0.90 -0.99 -0.99
##            week8       17.77831 4.2164   -0.14  0.14 -0.79 -0.79 -0.95 -1.00
##  Residual              89.09235 9.4389                                      
##             
##             
##             
##             
##             
##             
##             
##             
##   0.99      
##   0.97  1.00
##             
##             
##             
##             
##             
##             
##             
##  -0.41      
##  -0.59  0.98
##             
## Number of obs: 360, groups:  subject, 20; treatment, 2
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)       49.9555     2.3406  21.343
## as.numeric(week)  -2.5854     0.3179  -8.132
## treatment2         0.8249     0.9996   0.825
## 
## Correlation of Fixed Effects:
##             (Intr) as.n()
## as.nmrc(wk) -0.710       
## treatment2  -0.214  0.000
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

Anova analysis of the two models

anova(bprs_ref, bprs_ref1)
## Data: bprs
## Models:
## bprs_ref: dose ~ as.numeric(week) + treatment + (1 | subject)
## bprs_ref1: dose ~ as.numeric(week) + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## bprs_ref     5 2748.7 2768.1 -1369.4   2738.7                     
## bprs_ref1   49 2822.4 3012.8 -1362.2   2724.4 14.288 44          1
anova(bprs_ref1, bprs_ref2)
## Data: bprs
## Models:
## bprs_ref1: dose ~ as.numeric(week) + treatment + (week | subject)
## bprs_ref2: dose ~ as.numeric(week) + treatment + (week | subject) + (week | treatment)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## bprs_ref1   49 2822.4 3012.8 -1362.2   2724.4                     
## bprs_ref2   94 2902.4 3267.7 -1357.2   2714.4 10.036 45          1
ggplot(bprs, aes(x = as.numeric(week), y = dose, linetype = subject, colour = treatment)) +
  geom_line(alpha = 0.5) +
  scale_linetype_manual(values = rep(1:10, times = 40), guide = FALSE) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  xlab("Week") +
  ylab("Dose")

fitted_values <- fitted(bprs_ref1)
bprs <- mutate(bprs, fitted = fitted_values)

library(ggplot2)

ggplot(bprs, aes(x = as.numeric(week), y = fitted, linetype = subject, colour = treatment)) +
  geom_line(alpha = 0.5) +
  scale_linetype_manual(values = rep(1:10, times = 40), guide = FALSE) +
  guides(colour = guide_legend(override.aes = list(alpha = 1))) +
  xlab("Week") +
  ylab("Fitted Values")